Event processing system

ABSTRACT

User events of a platform are processed to extract aggregate information about users of the platform at an event processing system. A query relating to the user events is received at the system and at least one query parameter is determined from the query. Various privacy controls are disclosed for ensuring that any information released in response to the query cannot be used to identify users individually or to infer information about individual users.

TECHNICAL FIELD

The present invention relates to a system for processing events.

BACKGROUND

There are various contexts in which it is useful to extract aggregatedand anonymized information relating to users of a platform.

For example, understanding what content audiences are publishing andconsuming on social media platforms has been a goal for many for a longtime. The value of social data is estimated at $1.3 trillion but most ofit is untapped. Extracting the relevant information is challengingbecause of the vast quantity and variety of social media content thatexists, and the sheer number of users on popular social media platforms,such as Facebook, Twitter, LinkedIn etc. It is also made even morechallenging because preserving the privacy of the social media users isof the utmost importance.

A data platform that is available today under the name DataSift PYLONconnects to real-time feeds of social data from various social mediaplatforms (data sources), uncovers insights with sophisticated dataaugmentation, filtering and classification engine, and provides the datafor analysis with an appropriate privacy protocol required by the datasources.

It allows insights to be drawn from posts, shares, re-shares, likes,comments, views, clicks and other social interactions across thosesocial media platforms. A privacy-first approach is adopted to thesocial media data, whereby (among other things) results are exclusivelyprovided in an aggregate and anonymized form that makes it impossible toidentify any of the social media users individually.

In the context of event processing, the need arises in various contextsto analyze numbers of unique users on a platform—not only social mediaplatforms where the events correspond to social interactions, but othertypes of platform with other types of user event.

SUMMARY

In order to preserve user privacy, it is desirable to impose restraintson the release of information from systems that process user events,such as events recording social interactions on a social media platformthat are processed to extract aggregate information about the socialinteractions. The goal is to ensure that individual users can never beidentified from the aggregate information released by the system andprevent the release of any information that could be attributed toindividual users.

In accordance with the present invention, user events of a platform areprocessed to extract aggregate information about users of the platformat an event processing system. A query relating to the user events isreceived at the system and at least one query parameter (filteringcondition) is determined from the query.

Various novel privacy controls are disclosed herein for ensuring thatany information released in response to the query cannot be used toidentify users individually or be attributed to individual users. Thesecan be used individually or in combination, with various examplesdescribed in detail below.

A first aspect of the present invention is directed to a method ofprocessing user events of a platform to extract aggregate informationabout users of the platform, the method comprising, at an eventprocessing system: receiving a query relating to the user events;determining at least one query parameter from the query; applying to theuser events an exact counting procedure for computing an exact count fora set of the user events satisfying the at least one query parameter, byindividually identifying and counting those user events; generatingerror data for the exact counting procedure; using the generated errordata to introduce an artificial error in the exact counting procedure,thereby generating an approximate count for the set of user eventsdeviating from the exact count by an unpredictable amount; andresponding to the query relating by releasing aggregate informationcomprising or derived from the approximate count for the set of userevents satisfying the at least one query parameter.

The inventors of the present invention have recognized that, where theaggregate information is derived using an exact counting process inwhich the user events are individually counted (rather than aprobabilistic estimation procedure, e.g. based on HyperLogLog or one ofits variants), this could open-up the system to a certain type of attackin which it may be possible to identify individual users or inferinformation about specific users from the released information incertain circumstances. This attack is explained at length below, but fornow suffice it to say that the attack is rendered ineffective bydeliberately introducing the artificial error into the countingprocedure and using the count with this deliberate error as a basis forresponding to the query.

In embodiments, the count is a unique user count for the set of userevents satisfying the at least one query parameter, or an event countfor the set of user events satisfying the at least one query parameter.Alternatively, both the unique user count and the event count may begenerated in this manner.

The method may comprise a step of quantizing the approximate count, theaggregate information comprising or derived from the quantized count.

For an inexact unique user count, the method may comprise a step ofcomparing the inexact user count to a threshold, wherein the aggregateinformation is released in response to determining that the inexactunique user count is no less than a minimum permitted user countindicated by the threshold.

At least one query parameter may be determined from the query for eachof a plurality of buckets. The exact counting procedure may be appliedfor each of the buckets, with respective error data being generatedunpredictably for each of the buckets individually and used to introducean artificial error in the exact counting process for that bucket togenerate an inexact bucket count for each of the buckets, wherebydiffering artificial errors are exhibited across the bucket counts. Thatis, to ensure a range of errors across the bucket counts.

The method may comprise a step of quantizing each of the inexact bucketscounts for release.

An overall count across all of the buckets may be also generated,corresponding to a sum of all of the buckets counts. This can, forexample, be an overall count of all events in an index, or a subset ofthe events in the index events that satisfy at least one overall queryparameter determined from the query (i.e. with two stages of filtering,to isolate events of interest and then provide a breakdown of theresults for those events). The exact counting procedure may also beapplied to generate an inexact overall count for the user events in theindex or the subset of the user events satisfying the at least oneoverall query parameter, with each of the buckets corresponding to a(further) subset thereof, such that the individual bucket counts providea breakdown of the overall count.

The bucket counts may be unique user counts, each of which is comparedwith a bucket redaction threshold, wherein any of the buckets for whichthe inexact unique user count is below a minimum user count indicated bythe bucket redaction threshold is redacted. That is, aggregateinformation is withheld for that bucket.

The overall count may be an overall unique user count and the methodcomprises a step of comparing the overall unique user count with agating threshold to determine whether to accept or reject the query,wherein the query is accepted in response to determining that overallunique user count is at least a minimum user count indicated by thegating threshold.

Each of user events may be purged from the event processing system uponexpiry of a retention period for that user event, whereby user eventsare not counted once purged.

User events for users below an age threshold may not be counted.

Each of the user events may comprise an identifier of one of theplatform users and the count may be a unique user count generated fromthe user identifiers in the set of user events.

The exact counting procedure may comprise computing the exact count forthe set of user events satisfying the at least one query parameter, andusing the error data to modify the exact count once computed to generatethe approximate count deviating from the exact count by theunpredictable amount. For example, the unpredictable amount may beunpredictably selected from a percentage range of the exact count.

Alternatively, the artificial error can be introduced at some otherstage(s) of the process, as the count is generated.

A second aspect of the present invention is directed to a method ofprocessing user events of a platform to extract aggregate informationabout users of the platform, the method comprising, at an eventprocessing system: receiving a query relating to the user events;determining at least one query parameter from the query; computing aunique user count for a set of the user events satisfying the at leastone query parameter; comparing the unique user count to a meteringthreshold; and rejecting the query if the unique user count exceeds amaximum permitted user count indicated by the metering threshold.

This upper limit is a simple and effective way of preventing what isreferred to herein as “metering”, which refers to the use of broadqueries on large, strategic populations of users to infer informationabout activity on the platform as a whole, such as the total number ofsocial interactions (of any kind) or active users (of any demographic)on a social media platform within a given time period.

In embodiments, the unique user count that is compared to the meteringthreshold may be estimated from a representative sample of the userevents in an index.

If the unique user count does not exceed the maximum permitted usercount, the unique user count may be re-computed from a larger number ofthe user events in the index, e.g. all of the user events in the index.

The re-computed user count may be compared with a gating threshold, andin that event the query is rejected if the re-computed user count isless than a minimum permitted user count indicated by the gatingthreshold and accepted otherwise.

The metering threshold may be set as a function of a global unique usercount for the platform, e.g. as a percentage of the global unique usercount for the platform.

The metering threshold may be set in dependence on a statisticalanalysis of the user events.

A third aspect of the present invention is directed to a method ofprocessing user events of a platform to extract aggregate informationabout users of the platform, the method comprising, at an eventprocessing system: receiving a query relating the user events;determining at least one query parameter from the query; generating atleast one count for a set of the user events satisfying the at least onequery parameter; and applying quantization to the at least one count togenerate at least one quantized count for release, the quantized countbeing one of a plurality of permitted quantized values, wherein thequantization has a variable quantization range, the quantization rangebeing the difference between adjacent pairs of the permitted quantizedvalues.

In embodiments, the quantization range may increase for larger permittedquantized values.

For example, the quantization range may increase “linearly” with respectto the permitted quantized values, such that quantization steps scalelinearly with audience size (number of unique users). By way of example,a possible set of quantization rules might be as follows:

audience size for the query quantization step on each bucket <10,000unique users round down to the nearest 100th <50,000 unique users rounddown to the nearest 500th <100,000 unique users round down to thenearest 1000th <500,000 unique users round down to the nearest 5000th

Note this is just one example, and the steps may be non-linear (e.g.they may increase quadratically with audience size).

Multiple counts may be generated, and quantization may be applied to allof them.

A fourth aspect of the present invention is directed to a method ofprocessing user events of a platform to extract aggregate informationabout users of the platform, the method comprising, at an eventprocessing system: receiving a query relating to the user events;determining at least one query parameter from the query; computing aunique user count for a set of the user events satisfying the queryparameter; setting a variable gating threshold for the query as afunction of the at least one query parameter; comparing the unique usercount with the gating threshold set for the query; and rejecting thequery if the unique user count is less than a minimum permitted usercount indicated by the gating threshold set for the query, whereby theminimum permitted user count depends on the at least one queryparameter.

In embodiments, the at least one query parameter may comprise a userattribute, and the variable gating threshold may be set as a function ofthe user attribute.

A fifth aspect of the present invention is directed to a method ofprocessing user events of a platform to extract aggregate informationabout users of the platform, the method comprising, at an eventprocessing system: receiving a query relating to the user events;determining at least one query parameter from the query; computing, fora set of the user events satisfying the at least one query parameter, anapproximate count with an error margin of at least two percent; applyingquantization to the approximate count to generate a quantized count; andresponding to the query by releasing aggregate information comprising orderived from the quantized count.

The approximate count (C_(A)) for the set of user events is computedwith an error margin (E) of at least 2% in that it deviates from anexact count (C_(E)) for that set by an unpredictable amount (D), whereinD conforms to a probability distribution having a standard deviation (σ)and the error margin E is defined as follows:

$E = {\frac{\sigma}{C_{E}} \geq {2\%}}$

That is:

Pr(D=d)=f(d)

where Pr(D=d) is the probability that D=d and f(d) is the probabilitydistribution having standard deviation σ that is at least 2% of theexact count C_(E).

In the context of obtaining counts for user events, this is asignificantly larger error than might be expected. However, theinventors of the present invention have recognized that, when combinedwith quantization, an error of this magnitude ensures that theinformation released by the system is anonymized. In particular, thiscombination provides robust protection against a form of “set-balancingattack” that is set out in detail below. To achieve this, an errormargin of E between about 2-3% is generally expected to be sufficient.

The error can be an error that is artificially introduced into an exactcounting procedure, or it can be error that is intrinsic to aprobabilistic count estimation procedure, providing that steps are takento ensure that this error is of sufficient magnitude. That is, errormargin may be a consequence of introducing an artificial error into anexact counting procedure applied to the user event or it may beintrinsic to a probabilistic count estimation procedure applied to theuser events to generate the count.

In embodiments, the error margin may be at least three percent.

The quantization may have a quantization range of at least one hundred(for example, the count may be quantized by rounding it down to thenearest one hundred).

Note that, in relation to the first and fifth aspects in particular, thedeviation is “unpredictable” in that it would not possible for anexternal observer to predict what deviation would be introduced for anew given query from the results of different queries submitted to thesystem, so that the deviation appears random to the external observer.Accordingly, a deviation is considered unpredictable not only when it isindeterministic but also when it is deterministic but unpredictable tousers of the system who lack knowledge of how it is created (that is,who lack knowledge of the process used to create it and/or theunderlying data to which that process has been applied). For theavoidance of doubt it is noted that the terms “random” and “randomized”are used interchangeably with the term “unpredictable”, and as such arenot limited to indeterministic behaviour but also encompass behaviourthat is deterministic but appears random in this sense.

For example, in accordance with the first aspect of the invention, theerror data can be generated in a truly-random sense (e.g. based onquantum-mechanical phenomena, in which case the deviation isunpredictable even with knowledge of how it is determined) but also in apseudo-random sense, in which case the deviation is unpredictable, andappears random, without knowledge of how it is generated. For example,where the deviation is generated pseudo-randomly by applying analgorithm to a seed, it is unpredictable and therefore appears randomwithout knowledge of the algorithm and the seed. (It is also noted, forthe avoidance of doubt that, the extent of the deviation that has beenintroduced for any given query may not even be derivable from theinformation released from the system due to the quantization).

In any of the above, the platform may be a content publication platformfor publishing and consuming content, the user events relating to thepublication and consumption of content by the users of the contentpublishing platform.

However, it is noted that whilst the “user events” referred to hereincan relate to social interactions on a social media platform(publishing/consuming content), the invention is not limited to this andthe system can be used for processing other types of events. Theplatform can be any platform with a user base that facilitates useractions. The platform provider could for example be a telecoms operatorlike Vodafone or Verizon, a car-hire/ride-share platform like Uber, anonline market place like Amazon, a platform for managing medicalrecords. The events can for example be records of calls, car rides,financial transactions, changes to medical records etc. conducted,arranged or performed via the platform. There are numerous scenarios inwhich it is beneficial to extract anonymous and aggregated informationfrom such events, where the need to obtain a count, such as a uniqueuser count or event count, over a set of such events arises.

In this respect, it is noted that all description pertaining tointeraction events of a social media platform (content items) hereinapplies equally to other types of events of platforms other than socialmedia. Each user event can be any event relating to the user with whichit is associated. Each of the user events may relate to an actionperformed by or otherwise relating to one of the users of the platformand comprise an identifier of that user. That is, each of the userevents may be a record of a user-related action on the platform.

Such events can comprise or be associated with user attributes and/ormetadata for the actions to which they relate, allowing those events tobe processed (e.g. filtered and/or aggregated) using any of thetechniques described herein, for example to count the number of eventssatisfying a filter (e.g. at least one query parameter defined in orderived from a query) and the number of unique users across thoseevents.

Another aspect of the present invention is directed to an eventprocessing system comprising computer storage holding executableinstructions and one or more processing units configured to executethose instructions to carry out any of the method steps or systemfunctionality disclosed herein.

Another aspect of the present invention is directed to a computerprogram product comprising executable instructions stored on a computerreadable storage medium and configured, when executed at an eventprocessing system, to carry out any of the method steps or systemfunctionality disclosed herein.

BRIEF DESCRIPTION OF FIGURES

For a better understanding of the present invention, and to show howembodiments of the same may be carried into effect, reference is made byway of example to the following figures in which:

FIG. 1A shows a schematic block diagram of an index builder of a contentprocessing system;

FIG. 1B shows a schematic block diagram of a real-time filtering andaggregation component of a content processing system;

FIG. 2 shows a schematic block diagram of a computer system in which acontent processing system can be implemented;

FIG. 3 shows a block diagram of a content processing system inaccordance with the present invention;

FIG. 4 shows an example of a user-centred for a content processingsystem;

FIG. 5 shows a flowchart for a method of filtering and counting eventsin an index and FIG. 5A shows an example of the method applied tocertain events;

FIG. 6 shows a schematic illustration of an ordered data structure;

FIG. 7 shows a high level illustration of certain privacy controls;

FIG. 7A shows an example of minimum audience-size gating;

FIG. 7B shows an example of quantization and redaction; and

FIG. 8 illustrates an example of a metering prevention technique.

DETAILED DESCRIPTION OF EMBODIMENTS

FIG. 1A shows a high level overview of part of a content processingsystem for processing content items 604 of a social media platform.

Each of the content items 604—also called “interaction events” or simply“events” herein—is a record of an “interaction” on the social mediaplatform (social interaction), which can be a social media userpublishing a new piece of content or consuming an existing piece ofcontent. Examples of different publishing or consuming actions are givenlater. The events are provided by the social media platform, which isreferred to as a “data provider” in this context. They are provided as areal-time data stream or multiple real-time data streams (e.g. differentstreams for different types of events), also referred to as “firehoses”herein. That is, the events 604 are received in real-time at an indexbuilder 600 of the content processing system as the corresponding socialinteractions take place.

Indexes, such as index 602, can be created within the index builder 600.An index is a database in which selectively-made copies of the events602 are stored for processing. An index can for example be a bespokedatabase created by a querying user for his own use, i.e. a user of thecontent processing system who wishes to submit queries to it (such as acustomer), or it can be a shared index created by an operator of thecontent processing system for use by multiple customers. The index 602holds copies of selected events 604, which are selected by a filteringcomponent 608 of the index builder 600 according to specified filteringrules. These filtering rules are defined in what is referred to hereinas an “interaction filter” 606 for the index 602. Viewed in slightlydifferent terms, an index can be seen as a partial copy of a globaldatabase (the global database being the set of all events received fromthe data provider) that is populated by creating copies of the events604 that match the interaction filter 606.

The index 602 can be created in a “recording” process, which isinitialized by providing an interaction filter 606 and which runs from atiming of the initialization to capture events from that point onwardsas they occur in real-time. It may also be possible for an index tocontain historical events. The interaction filter 608 is applied by thefiltering component 608 in order to capture events matching theinteraction filter 606, from the firehoses, as those events becomeavailable. The process is a real-time process in the sense that it takesas an input the “live” firehoses from the data provider and captures thematching events in real-time as new social interactions occur on thesocial media platform. The recording process continues to run until thecustomer 606 (in the case of a bespoke index) or service provider (inthe case of a shared index) chooses to suspend it, or it may besuspended automatically in some cases, for example when system limitsimposed on the customer are breached.

Each of the events 604 comprises a user identifier of the social mediauser who has performed the corresponding interaction. As explained infurther detail later, by the time the events 604 arrive at the filteringcomponent 608, preferably every one of the events comprises a copy ofthe content to which it relates; certain “raw” events, i.e. as providedby the data provider, may not include the actual content when firstprovided, in which case this can be obtained and added in an“augmentation” stage of the content processing system, in which “contextbuilding” is performed.

User attributes of the social media users are made available by the dataprovider from user data of the social media platform, for example fromthe social media users' social media accounts (in a privacy-sensitivemanner—see below). A distinguishing characteristic of such userattributes is that they are self-declared, i.e. the social media usershave declared those attributes themselves (in contrast to userattributes that need to be inferred from, say, the content itself). Theattributes be provided separately from the raw events representing thepublication and consumption of content from the data provider. Forexample, an attribute firehose may be provided that conveys the creationor modification of social media profiles in real-time. In that case, aspart of the context building, the events 604 relating to the publicationand consumption of content can be augmented with user attributes fromthe attribute firehose, such that each of the augmented events 604comprises a copy of a set of user attributes for the social media userwho has performed the interaction.

The idea behind context building is to add context to events that lackit in some respect. For example, a user identifier (ID) in an incomingevent may simply be an anonymized token (to preserve user privacy) thathas no meaning in isolation; by adding user attributes association. Indatabase terminology, context building can be viewed a form ofde-normalization (vertical joining). Another example when a dataprovider provides a separate firehoses of “likes” or other engagementswith previous events.

The customer or service provider is not limited to simply setting theparameters of his interaction filter 606; he is also free to set rulesby which the filtered events are classified, by a classificationcomponent 612 of the index builder 600. That is, the customer/serviceprovider has the option to create a classifier 610 definingclassification rules for generating and attaching metadata to the eventsbefore they are stored in the index 602. These classification rules can,for example, be default or library rules provided via an API of thecontent processing system, or they can be rules which the customer orservice codes himself for a particular application.

Individual pieces of metadata attached to the events 604 are referred toherein as “tags”. Tags can include for example topic indicators,sentiment indicators (e.g. indicating positive, negative or neutralsentiment towards a certain topic), numerical scores etc., which thecustomer or service provider is free to define as desired. They couldfor example be rules based on simple keyword classification (e.g.classifying certain keywords as relating to certain topics or expressingpositive sentiment when they appear in a piece of content; orattributing positive scores to certain keywords and negative scores toother keywords and setting a rule to combine the individual scoresacross a piece of content to give an overall score) or using moreadvanced machine learning processing, for example natural languagerecognition to recognize sentiments, intents etc. expressed in naturallanguage or image recognition to recognize certain brands, items etc. inimage data of the content. The process of adding metadata tags toevents, derived from the content to which they relate, is referred to as“enrichment” below.

In addition to bespoke tags added through enrichment, the events mayalready have some tags when they are received in the firehoses, forexample time stamps indicating timings of the correspondinginteractions, geolocation data etc.

With the (additional) tags attached to them in this manner according tothe customer's bespoke definitions, the filtered and enriched events arestored in the index 602, populating it over time as more and more eventsmatching the interaction filter 608 are received.

Multiple indexes can be created in this manner, tailored to differentapplications in whatever manner the service provider/customers desire.

It is important to note that, in the case of private social media datain particular, even when the customer has created the index 602 usinghis own rules, and it is held in the content processing system on hisbehalf, he is never permitted direct access to it. Rather, he is onlypermitted to run controlled queries on the index 602, which returnaggregate information, derived from its contents, relating to thepublication and/or consumption of content on the content publicationplatform. The aggregate information released by the content sharingsystem is anonymized i.e. formulated and released in a way that makes itimpossible to identify individual social media users. This is achievedin part in the way the information is compiled based on interaction andunique user counts (see below) and in part by redacting informationrelating to only a small number of users (e.g. less than one hundred).

Queries are discussed in greater detail below but for now suffice it tosay that two fundamental building blocks for the anonymized aggregateinformation are:

-   -   1) interaction counts, and    -   2) associated unique user counts.

These counts can be generated either for the index 602 as a whole or (inthe majority of cases) for a defined subset of the events in the index602, isolated by performing further filtering of the events held in theindex 602 according to “query filters” as they are referred to herein.Taken together, these convey the number of interactions per unique userfor the (sub)set of events in question, which is a powerful measure ofoverall user behaviour for the (sub)set of events in question.

The interaction count is simply the number of events in the index 306 orsubset, and the unique user count is the number of unique users acrossthose events. That is, for a query on the whole index 602, the number ofevents that satisfy (match) the index's interaction filter 606 and thenumber of unique social media users who collectively performed thecorresponding interactions; for a query on a subset of the index 602defined by a query filter(s), the interaction count is the number ofevents that also match that query filter(s) (e.g. 606 a, 606 b, FIG.1B—see below) and the number of unique social media users whocollectively performed the corresponding subset of interactions.Successive query filters can be applied, for example, to isolate aparticular user demographic or a particular set of topics and thenbreakdown those results into “buckets”. Note, this does not meansuccessive queries have to be submitted necessarily; a single query canrequest a breakdown or breakdowns of results, and the layers offiltering needed to provide this breakdown can all be performed inresponse to that query. For example, results for a demographic definedin terms of gender and country could be broken down as a time series(each bucket being a time interval), or in a frequency distributionaccording to gender, most popular topics etc. These results can berendered graphically on user interface, such as a dashboard, in anintuitive manner. This is described in greater detail later.

For example, to aggregate by gender (one of “Male”, “Female”, “Unknown”)and age range (one of “18-25”, “25-35”, “35-45”, “45-55”, “55+”), in theresponse to an aggregation query (unique user and interaction) countsmay be generated for each of the following buckets:

Bucket Male, 18-25 Male, 25-35 Male, 35-45 Male, 45-55 Male, 55+ Female,18-25 Female, 25-35 Female, 35-45 . . . Unknown, 55+

Despite their simplicity, these fundamental building blocks areextremely powerful, particularly when coupled with the user attributesand bespoke metadata tags in the enriched events in the index 602. Forexample, by generating interaction and user counts for different subsetsof events in the index 602, which are isolated by filtering according todifferent combinations of user attributes and tags, it is possible foran external customer to extract extremely rich information about, say,the specific likes and dislikes of highly targeted user demographics(based on the social interactions exhibited across those demographics)or the most popular topics across the index or subset thereof, withoutever having to permit the external customer direct access to the index602 itself.

For example, a useful concept when it comes to identifying trends withinparticular user demographics is the concept of “over-indexing”. This isthe notion that a particular demographic is exhibiting more interactionsof a certain type than average. This is very useful when it comes toisolating behaviour that is actually specific to a particulardemographic. For example, it might be that within a demographic, acertain topic is seeing a markedly larger number of interactions perunique user than other topic (suggesting that users are publishing orconsuming content relating to that topic more frequently). However, itmight simply be that this is a very popular topic, and that otherdemographics are also seeing similar numbers of interactions per uniqueuser. As such, this conveys nothing specific about the targetdemographic itself. However, where, say, a topic is over-indexing for atarget user demographic, i.e. seeing a greater number of interactionsper unique user across the target demographic than the number ofinteractions per unique user across a wider demographic, then thatcoveys information that is specific to the target demographic inquestion.

By way of example, FIG. 1B shows a real-time filtering and aggregationcomponent 652 of the content processing system implementing steps torespond to a query with two stages of filtering to give a breakdown inresponse to that query.

In the first stage of filtering 654 a, a first query filter 626 a isapplied to the index 602 (shown as one of multiple indexes) to isolate asubset of events 656 that match the first query filter 626 a. The firstquery filter 626 a can for example be defined explicitly in the query bythe customer, in order to isolate a particular demographic(s) of usersof a particular topic(s) (or a combination of both) that is of interestto him.

In the second state of filtering 654 b, second query filters 262 b(bucket filters) are applied to the subset of events 656. Each of thebucket filters is applied to isolate the events in the subset 656 thatsatisfy that bucket filter, i.e. the events in a corresponding bucket,so that total interaction and user counts can be computed for thatbucket. The total user and interaction counts for each bucket (labelled656.1-4 for buckets 1-4 in this example) are included, along with totaluser and interaction counts for the subset of events 656 as a whole, ina set of results 660 returned in response to the query. The results 660are shown rendered in a graphical form on a user interface, which is adashboard 654. That is, the result 660 is represented as graphicalinformation displayed on a display to the customer. The underlying setof results 660 can also be provided to the customer, for example in aJSON format, so that he can apply his own processing to them easily.

Multiple subsets can be isolated in this way at the first stagefiltering 626 a, and each can be broken down into buckets as desired atthe second stage 626 b.

The buckets can for example be time based, i.e. with each bucketcontaining events in the subset 656 within a different time interval.These are shown rendered on the dashboard 654 as a graphical time series655 a, with time along the x-axis and the counts or a measure derivedfrom the counts (such as number of interactions per unique user) on they-axis, which is a convenient and intuitive way of representing thebreakdown according to time. As another example, the buckets could betopic based (e.g. to provide a breakdown of the most popular topics inthe subset 656) or user based (e.g. to provide a breakdown according toage, gender, location, job function etc.), or a combination of both. Inthis case, it may be convenient to represent the results as a frequencydistribution or histogram 655 b, to allow easy comparison between thecounts or a measure derived from the counts (e.g. interactions per user)for different buckets. As will be appreciated, these are just examples,and it possible to represent the results for the different buckets indifferent ways that may be more convenient in some contexts. Theinformation for each bucket can be displayed alongside the equivalentinformation for the subset 656 as a whole for comparison, for example bydisplaying on the dashboard 654 the total user and interaction counts orthe total number of interactions per unique user across the subset 656as a whole etc. The dashboard 654 can for example provided as part of aWeb interface accessible to the customer via the Internet.

FIG. 2 shows a schematic block diagram of a computer system in whichvarious devices are connected to a computer network 102 such as theInternet. These include user devices 104 connected to the network 102and which are operated by users 106 of a social media platform.

The term “social media platform” refers herein to a content publicationplatform, such as a social network, that allows the social media users106 to interact socially via the social media platform, by publishingcontent for consumption by other social media users 106, and consumecontent that other social media users 106 have published. A social mediaplatform can have a very large number of users 106 who are sociallyinteracting in this manner—tens of thousands or more with the largestsocial media platform today currently having user bases approaching 2billion users. The published content can have a variety of formats, withtext, image and video data being some of the most common forms. A pieceof published content can be “public” in the sense that it is accessibleto any user 106 of the social media platform (in some cases an accountwithin the social media platform may be needed, and in others it may beaccessible to any Web user), or it can be “private” where it is renderedaccessible to only a limited subset of the social media users 106, suchas the sharing user's friends. That is, private content is renderedaccessible to only a limited audience selected by the user publishingit. Friendships and other relationships between the users 106 of thesocial media platform can be embodied in a social graph of the socialmedia platform, which is a computer-implemented data structurerepresenting those relationships in a computer readable format.Typically, a social media platform can be accessed from a variety ofdifferent user devices 104, such as smart phones, tablets and othersmart devices, or other general purpose computing devices such as laptopor desktop computers. This can be via a web browser or alternatively adedicated application (app) for the social media platform in question.Examples of social media platforms included LinkedIn, Facebook, Twitter,Tumblr etc.

Social media users 106 can publish content on the social media platformby generating new content on the platform such as status updates, postsetc., or by publishing links to external content, such as articles etc.They can consume pieces of content published by other social media users106 for example by liking, re-sharing, commenting on, clicking on orotherwise engaging with that content, or simply having that contentdisplayed to them without actively engaging with it, for example in anews feed etc. (that is, displaying a piece of content to a social mediauser is considered a consuming act in itself in some contexts, for whichan interaction event is created, as it is assumed the user has seen thedisplayed content). That is, the term “consumption” can cover bothactive consumption, where it is evident the user has made a deliberatechoice to consume a specific piece of content, and passive consumption,where all that is known is that a specific piece of content has beenrendered available to a user and it is assumed he has consumed it.

To implement the social media system, a back-end infrastructure in theform of at least one data centre is provided. By way of example FIG. 2shows first and second data centres 108 a, 108 b connected to thenetwork 102, however as will be appreciated this is just an example.Large social media systems in particular may be implemented by a largenumber of data centres geographically distributed throughout the world.Each of the data centres 108 a, 108 b is shown to comprise a pluralityof servers 110. Each of the servers 110 is a physical computing devicecomprising at least one processing unit 112 (e.g. CPU), and electronicstorage 114 (memory) accessible thereto. An individual server 110 cancomprise multiple processing units 112; for example around fifty. Anindividual data centre can contain tens, hundreds or even thousands ofsuch servers 110 in order to provide the very significant processing andmemory resources required to handle the large number of socialinteractions between the social media users 106 via the social mediaplatform. In order to publish new content and consume existing content,the user devices 104 communicate with the data centres 108 a, 108 b viathe network 102. Within each of the data centres 108 a, 108 b, data canbe communicated between different servers 110 via an internal networkinfrastructure of that datacentre (not shown). Communication betweendifferent data centres 108 a, 108 b, where necessary, can take place viathe network 102 or via a dedicated backbone 116 connecting the datacentres directly. Those skilled in the art will be familiar with thetechnology of social media and its possible implementations so furtherdetails of this will not be described herein.

The frequent and varied social interactions between a potentially verylarge number of social media users 106 contains a vast array ofinformation that is valuable in many different contexts. Howeverprocessing that content to extract information that is meaningful andrelevant to a particular query presents various challenges.

The described embodiments of the present invention provide a contentprocessing system which processes events of the kind described above inorder to respond to queries from querying users 120 with targetedinformation relevant to those queries, in the manner outlined above. Thequerying users 120 operate computer devices 118 at which they cangenerate such queries and submit them to the content processing system.

A data processing system 200 comprising the content processing system202 will now be described with reference to FIG. 3, which a schematicblock diagram for the system 300.

The content processing system 202 is shown to comprise a content manager204, and attribute manager 206, a content processing component 208 and aquery handler 210. The content manager 204, attribute manager 206,content processing component 208 and query handler 210 of the contentprocessing system 202 are functional components, representing differenthigh level functions implemented within the content processing system202.

At the hardware level, the content processing system 202 can beimplemented in the data centres 108 a, 108 b of the social media systemback end itself (or in at least one of those data centres). That is, bycontent processing code modules stored in the electronic storage 114 andexecuted on the processing units 112. Computer readable instructions ofthe content processing code modules are fetched from the electronicstorage 114 by the processing units 112 for execution on the processingunits 112 so as to carry out the functionality of the content processingsystem 202 described herein. Implementing the content processing system202 in the social media data centres 108 a, 108 b themselves isgenerally more efficient, and also provides a greater level of privacyand security for the social media users 106, as will become apparent inview of the following. However, it is also viable to implement it in aseparate data centre (particularly when only public content is beingprocessed) that receives a firehose(s) from the social media platformvia the Internet 102.

As explained below, the content manager 204 and attribute manager 206form part of a privatization stage 210 a of the content processingsystem 202. They co-operate so as to provide an internal layer ofprivacy for social media users by removing all user-identity from theevents and user attributes before they are passed to the contentprocessing component 208. The content processing component 208 and queryhandler 210 constitute a content processing stage 210 b of the contentprocessing system 202, at which events and attributes are processedwithout ever having access to the users' underlying identities in thesocial media platform. This privatization is particularly important forprivate content.

The steps taken to remove the user-identity can be seen as a form ofanonymization. However, for the avoidance of doubt, it is noted thatremoving the user-identity does not fully anonymize the events 212 oruser data, as it may still be possible to identify individual usersthrough careful analysis based on their attributes and behaviour. Forthis reason, the anonymized events and user data are never released bythe content processing system 202, and the additional anonymizationsteps outlined above are taken on top of the removal of the useridentity to ensure that individual users can never be identified fromthe aggregate information released by the system 202.

To implement the privatization, the content manager 204 receives events212 of the social media platform where, as noted, each of the events 212represents a social interaction that has occurred on the social mediaplatform and comprises a user identifier 214 of one of the social mediausers 106 who performed that interaction. That is, the user whopublished or consumed the piece of content to which the event relates.The user identifiers 214 in the events 212 constitute public identitiesof the social media users 106. For example, these can be user names,handles or other identifiers that are visible or otherwise accessible toother social media users 106 who can access the published content inquestion. As part of the privatization stage 210 a, the content managermodifies the events 212 to replace the public identifiers 214 withcorresponding anonymized user identifiers 224 in the modified events222, which can for example be randomly generated tokens. Within thecontent processing stage 210 b, the anonymized tokens 224 act assubstitutes for the public identifiers 214. The content manager 204replaces the public identifiers 214 with the anonymous tokens 224 in aconsistent fashion, such that there is a one-to-one relationship betweenthe public identifiers 214 and the corresponding tokens 224. However,the public identifiers 214 themselves are not rendered accessible to thecontent processing stage 210 b at any point.

Beyond the fact that these anonymized identifiers 224 allow each user'sevents to be linked together, these anonymized tokens 224 do not conveyany information about the identity of the social media users 106themselves.

As such, an important function of the attribute manager 206 is one ofgenerating what are referred to herein as “anonymized user descriptions”240. Each anonymized user description 240 comprises a set of attributesfor one of the social media users 106 and is associated with theanonymized user identifier 224 for that user. In the example of FIG. 3B,each of the anonymized user descriptions 240 comprises a copy of theanonymized user identifier 224 and is provided to the content processingcomponent 208 separately from the modified events 222. This in turnallows the content processing component 208 to link individual events222 with the attributes for the user in question by matching theanonymized tokens in the anonymized user descriptions 240 to those inthe events 224, and augmenting those events with those attributes. Theuser descriptions 240 can be updated as the user attributes change, oras new user information becomes available, for incorporation insubsequent events. Alternatively, the user attributes could instead beprovided to the content processing component 208 as part of the events222 themselves.

The attribute manager 206 can determine the user attributes 226 for theanonymized user descriptions 240 from user data 242 of the social mediasystem itself. For example, the user data that forms part of the socialmedia user's accounts within the social media system. The social mediauser data 242 can for example comprise basic demographic informationsuch as gender, age etc. From this, the attribute manager 206 candetermine basic user attributes such as gender attributes, age (or agerange) attributes etc.

User attributes determined from the user data 242 of the social mediasystem itself are referred to herein as a first type of user attributeor, equivalently, “native” attributes (being native to the social mediaplatform itself). The attribute manager 206 may also be able todetermine user attributes of other types in certain circumstances, fromother sources of data.

The query handler 210 handles incoming queries submitted to the contentprocessing system 202 by the querying users 120. These queries areessentially requests for aggregate information relating to thepublication and/or consumption of content within the social mediasystem. As noted, this may involve applying a querying filter(s) where,in general, a querying filter can be defined in terms of any desiredcombination of user attributes 226 and/or tags. The content processingcomponent 208 filters the events 222 to filter out any events that donot match the querying filter.

The basic elements of a query essentially fall into one of twocategories: elements that specify user demographics (in terms of userattributes); and elements that specify particular content (in terms oftags). For the former, the aim is to filter out events 222 for usersoutside of the desired demographic (filtering by user attribute). Forthe latter, the aim is to filter out events that are not relevant to thespecific tags, (filtering by metadata).

For example, for a query defined in terms of one or more user attributesand one or more tags (see above), the content processing component 208filters out any events 222 for users without those attributes and anyevents 222 that do not match those tags, leaving only the events forusers having those attributes and which also match those tags. From thefiltered events (i.e. the remaining events) the content processingcomponent 208 can extract the desired aggregate and anonymizedinformation.

As will be appreciated, this is a relatively simple example presentedfor the purposes of illustration and it is of course possible to buildmore a complex queries and to return results with more detailedinformation. For example, a general query for any popular topics for aspecified demographic of users (as defined by set of attributes) mayreturn as a result one or more popular topics together with a number ofunique users in that demographic and who been engaging with that topic.As another example general query requesting information about whichdemographics a specified topic is popular with may return a set of userattributes and a number of unique users having those attributes and whohave engaged with that topic recently. Here, the concept mentioned aboveof over-indexing becomes pertinent: for example, the response to thequery may identify demographics (in terms of attributes) for which thetopic is over-indexing, i.e. indicating that this topic is not merelypopular within that demographic but more popular than the average acrossall demographics (or at least a wider demographic).

As noted, certain types of tag, such as topic, can be generated byprocessing the pieces of published content 216 themselves, for exampleusing natural language processing in the case of text and imagerecognition in the case of static images or video. This enrichment canbe performed before or after the user-identities have been stripped out(or both).

Queries submitted to the content processing system 202 are handled andresponded to in real time, where real time in this particular contextmeans that there is only a short delay of two seconds or less betweenthe query being received at the content processing system 202 and thecontent processing system 202 returning a result. The filtering neededto respond to the query is performed by the content processing component208 in response to the submission of the query itself. That is, thecontent processing component 208 performs the filtering in real-timewhen the query is received. Any pre-processing or enrichment of theevents need not be performed in real time, and can for example beperformed as the events are received at the relevant part of the system.

Once the events 222 have been filtered as needed to respond to the queryin question, the content process component 208 extracts, from thefiltered events in real-time, anonymized, aggregate information aboutsocial interactions on the social media platform. That is, aggregateinformation about the publication and/or consumption of content by thesocial media users 106.

As will be apparent, new events 212 will be constantly generated as thecontent processing system 202 is in use. For example, for popular socialmedia platforms, hundreds of thousands of new events may be generatedevery minute as users frequently publish new content or consume existingcontent. To handle the large volume of data, the resulting anonymizedevents 222 are only retained at the anonymized content processing stage210 b for a limited interval of time, for example 30 days or so. In thatcase, the result returned in response to a query relates to activitywithin the social media platform within that time interval only.

Alternatively, rather than a blanket retention rule of this nature, theamount of time for which events 222 are retained may be dependent on theevents themselves. For example events relating to more popular contentmay be retained for longer. This allows older information for morepopular content to be released upon request.

FIG. 3 also shows details of the content processing component 210 in oneembodiment of the present invention. The content processing component isshown to comprise an augmentation component 272, which receives theevents 222 and the user descriptions 224. These can for example bereceived in separate firehoses. The augmentation component augments theevents 224 with the user attributes 226. That is, for every one of theevents 222, the augmentation component adds, to that event 222, a copyof the user attributes associated with the user identifier in that event222. The augmented events 223 are passed to an index builder 274, whichcorresponds to the index builder 600 in FIG. 1A and operates asdescribed above to create indexes 278 populated with selected andenriched ones of the augmented events 223. The indexes 278 are renderedaccessible to a real-time filtering and aggregation component 276 of thecontent processing component 210, which operates as described above withreference to FIG. 1B in order to filter and aggregate events in theindex in real-time as and when it is instructed to do so by the queryhandler 210. The indexes 278 and filtering and aggregation component 276are also shown in FIG. 3A. Events 223 are purged from the indexes 278 inaccordance with the retention policy.

As indicated above, whilst the privatization stage 210 a is particularlyimportant for private content, it is not essential, and can inparticular be omitted for public content in some contexts. In that case,the above techniques can be applied to the original events 212 itemsdirectly, using the public identifiers 214 in place of the anonymizedidentifiers 224.

An architecture for the content processing system 202 that is preferredin some contexts is described below with reference to FIG. 4, but firstsome of the considerations that led to this architecture are explained.

There are various contexts in which it is desirable to compute a numberof unique users across a set of data selected for a specific query. Forexample, for a query requesting a breakdown into multiple buckets, acount for each bucket as described above.

In most indexing systems that are currently used to process largeamounts of time-dependent data, events are organised by time. Forexample, events may be “sharded” (horizontally partitioned) on multiplenodes, such as database servers, according to time. This is a naturalway of sharding time dependent events, such as the events in theincoming firehoses from a data provider.

With time-centred sharding, a memory-efficient way of obtaining anapproximate unique user count across many nodes is to use aprobabilistic data structure, such as HyperLogLog++. In addition tobeing space-efficient, the HyperLogLog (HLL) data structure has aconvenient property that an object computed on one node can be mergedwith an object being computed on another node, and the resulting HLLobject still gives a good approximation of the number of distinctauthors across the two (or more) nodes; that is, the error rate remainslow and isn't compounded by merging multiple objects together.

In the present context, even in a single query, the need can arise toretrieve information that is broken down into potentially thousands ofbuckets. As an example, it may be desirable to filter events in theindex 278 on people in a certain age range and country, and return thecounts of people broken down into buckets by their industry by applyingadditional filtering on top of this, for example by their job function,by the top articles shared by each industry job function etc. With HLL,generating a response to a single query of this nature might needthousands of HLL objects to be generated. Moreover, a single user canrun multiple queries at once, and on top of that there can be multipleusers using the system simultaneously, which means the overall memoryrequirements to handle HLL objects on each node for all querying users(and transfer them over the network to merge them for each query) becomevery significant.

In contrast to the time-centred indexing systems of the kind describedabove, the content processing system 202 of FIG. 4 has a user-centredindexing architecture that allows a count of unique users across a setof events that match a filter to be determined very quickly and withreduced memory requirements. The count is obtained with greaterefficiency than HLL, both in terms of memory usage and speed ofcomputation. In the main embodiment described below, HLL objects are notutilised, and in contrast to HLL, an exact count is produced. However analternative embodiment is described wherein HLL objects are utilised andthe count is an estimate.

This increased efficiency is achieved by changing the way the data islaid out on disk and distributed on the various nodes of a cluster.

Rather than grouping events according to time, they are groupedaccording to user: the idea being to allocate all of the events for acertain user to the same node, for processing on that node, and to storeit in local storage at that node contiguously, i.e. so that events fordifferent users are never intertwined in the local storage. This way, tocompute the unique number of users for a certain query, it is sufficientto simply count the boundaries between different users who satisfy anapplied filter, without any requirement to store a record of all theuser IDs encountered so far (which would otherwise be needed to keeptrack of which users have already been counted). Moreover, because allthe data for a given user is sent to the same node, it's also safe tosum the unique counts across all nodes, without any risk of users being“double-counted” (i.e. the same user being counted more than once).

As noted above, the unique user count is one of the fundamental piecesof information that may be released by the content processing system202, and the present techniques allow it to be extracted quickly forprompt release in response to a query. The unique user count is alsohighly pertinent to privacy policy aspects: as noted, to maintainprivacy for the social media users 106, it is desirable to imposeprivacy policies which constrain when such information can be released.In particular, the release of information for any given bucket may onlybe released when the total number of unique users across that bucketexceeds a minimum threshold, and redacted if the threshold is not met.That is, for any given query, the privacy policies need to be satisfiedby each one of the corresponding buckets. With the present techniques,user counts can be generated extremely quickly for each bucket, allowinga very fast check to be performed on each bucket of whether it isnecessary to apply redactions to that bucket. Accordingly, with thisuser-centred architecture, queries can be responded to, and strongprivacy constraints can be guaranteed, with greater efficiency than HLL.

Maintaining data grouped by user can be achieved by routing the data byuser to a designated node for a range of users, and implementing a“staged” merging at each node, in which new data (new events) istemporarily “parked” (i.e. temporarily stored) in a “staging area”(queue) of that node, and implementing a periodic process to sort thenew data and merge it with the existing events already at that node.

FIG. 4 shows a schematic block diagram of the content processing system202 having the user-centred architecture, which is shown to comprise aplurality of processing nodes 302 (“content processors”). The nodes 302are processing devices, each comprising at least one and preferablymultiple processing units such as CPUs. That is, the processing nodes302 are computer devices, such as servers. A processing unit can be forexample a single-core processor or an individual core of a multi-coreprocessor. In any event, each of the processing units is a physical unitthat can execute content processing code, by fetching instructions ofthe content processing code from memory accessible to that processingunit and carrying out those instructions, simultaneously with the otherprocessing units in the same content processing server to carry outparallel processing within that server. Moreover, each of the nodes 302can perform such processing simultaneously with the other nodes 302 toperform parallel processing, across the nodes 302, of incoming events316 received at the content processing system 202 in the firehose(s),such as the anonymized events 222 of FIG. 2.

Eight individual nodes 302 a-h are shown in FIG. 4, but in practice itis expected that more processing nodes may be used in practice to handlethe large amount of content published and consumed on popular socialmedia platforms. The nodes 302 cooperate to implement the real-timefiltering and aggregation component 276 of FIG. 3, in order to filterand count events efficiently.

The nodes 302 of FIG. 4 are servers located in a data centre, and canfor example be a set of the servers 110 in one of the data centres 108a, 108 b of the social media platform itself. The data centre has aninternal network 312, provided by an internal, high-speed networkinfrastructure of the data centre, via which the nodes 302 cancommunicate with other components of the content processing system 202.

Each of the nodes 302 has access to its own local computer storage (thatis, local to that node), in which it can store and modify data to carryout its content processing functions, labelled 303 a-h for nodes 302 a-hrespectively. This can comprise volatile and/or non-volatile memory, forexample solid-state or magnetic storage (or a combination of both). Theterm “disk” is sometimes used as short-hand for the local storage at anode, though it will be appreciated that this term does not necessarilyimply traditional rotating-disk storage and also covers solid-statestorage (for example).

The nodes 302 may also have access to shared computer storage 314; thatis, shared between two or more of the nodes 312 and accessible via theinternal network 312 of the data centre. This can be located in the datacentre itself, or it may be external and accessed via an externalconnection connected to the internal network 312 (or a combination ofboth).

As well as the content processing code executed on the nodes 302 ofcontent processing system 202, control code is also executed within thecontent processing system 202. The control code coordinates the contentprocessing across the nodes 302, via the internal network 312, to ensurethat it is conducted in an efficient and reliable manner. This can beexecuted at one of the nodes 302 or at a separate computer device of thedata centre, or at multiple such devices in a distributed fashionprovided it is in a manner that permits it overall visibility andcontrol of the nodes 302.

In this respect, the content processing system 202 is shown to comprisea content allocator 304, a filter coordinator 306 and a total countgenerator 308. These are functional components of the content processingsystem 202, representing different high-level functions implemented bythe control code when executed within the content processing system 202.The components 304-308 are shown connected to the internal network 312,to represent the fact that the signalling within the data centre neededto coordinate these functions takes place via the local network 312.

As indicated, a key feature of the content processing system 202 of FIG.4 is that incoming events 316 representing social interactions on thesocial media platform—that is, social media users publishing new contentor consuming existing content—are allocated to individual nodes 302 forprocessing based on the user IDs in those events, such that all eventswith the matching user IDs (i.e. corresponding to the same social mediauser) are allocated to the same node. That is, each event representingan interaction is allocated based on the (anonymized) identity of theuser who performed that interaction. That is, each unique useridentifier for the social media platform is assigned to exactly one ofthe nodes 302, and that one node is responsible for filtering andaggregating all events associated with that unique identifier.

Note that, in the present context, a user identifier is assigned toexactly one node in the sense that all events for that user(corresponding to all of the social interactions performed by that user)are allocated to that node, and that node alone is responsible forfiltering and aggregating those events (to generate local interactionand user counts—see below) in carrying out the real-time filtering andaggregation functionality of the system (that is, component 276 in FIG.3). It does not exclude the possibility of making additional copies ofthe events and/or applying additional processing to those eventselsewhere in the system for other purposes, for example as part of abackup function.

The users can be assigned to nodes by using user IDs according to anordering function described below. However other techniques arepossible, for example the distribution strategy of users onto differentnodes might be based on a hash of the user ID, rather than the user IDitself. Moreover there is no implicit ordering requirement betweendifferent groups of users, besides the fact that the users are grouped.Most importantly, the events for any particular user are alwaysprocessed at the same node.

It can be convenient for each node to be assigned a range of useridentifiers within the set of existing user identifiers U. Inmathematical terms, an ordered set of user identifiers can be definedwith respect to the set U and a defined order relation ‘<’ wherein, forall u1, u2≠u1∈U, either u1<u2 or u2<u1. Viewed in these terms, the nodes302 are assigned ranges of user IDs:

-   -   node 303 a: [u0, u1]    -   node 302 b: [u1+1, u2]    -   node 302 c: [u2+1, u3]    -   etc.,        where [a,b] denotes the subset a, b and all values in U in        between a and b with respect to the order relation ‘<’. The        definition of the order relation is arbitrary provided it is        applied consistently within the content processing system 202,        though it can be convenient for it to reflect typical        mathematical conventions such as the ordering of the natural        numbers or the alphabet in the case of alphanumeric characters        or hexadecimals etc. An example is illustrated in FIG. 4,        showing ranges of events 316 f, 316 g allocated to nodes 302 f,        302 g respectively.

Note that, whilst it is convenient to allocate ranges of events in thismanner, it is not essential. What is important is that all events forthe same user ID are allocated the same node 302.

At each of the nodes 302 a-302 h, the events allocated to that node arestored on disk at that node (i.e. in local storage 303 a-h respectively)and are grouped therein according to user ID, i.e. in contiguous groups.That is, they are stored in a data structure having one group for eachuser ID assigned to that node, with all events associated with that userID (and only those events) stored in that group.

As explained below, this grouping in the local storage allows anextremely quick and efficient process, in which filtering and countingof the events is performed simultaneously to generate a unique usercount for that filter.

Particular advantages are gained where events for a user are held inphysically contiguous storage locations, particularly in a systemworking with billions of events. There may be a need to ‘freeze’ anindex and periodically create a new one when enough new data arrives.Contiguous physical storage locations of the events in the local storagereduce fragmentation on the disk when new indexes are created and allowfaster access to the local storage in performing the filtering andcounting procedure. However the grouping could be a logical groupingonly, not necessarily reflected at the physical level but it is expectedthat there will be a performance benefit to physical grouping.

By way of example, FIG. 4 shows, for events 316 f allocated to node 302f, groups gpr1, grp2, gpr3, . . . , grpN−1, grpN for user IDs usr4692,usr4693, usr4696, . . . , usr7929, usr7929A respectively.

A convenient way of implementing this grouping is by simply sorting theuser IDs in the local storage with respect to the ordering function ‘>’,as in the example of FIG. 4. That is, the range of events allocated toeach of the nodes 302 can stored in an ordered data structure in thelocal storage, ordered according to the user identifiers in the events,where each group is then just the range of events for that group's userID in that context. Regarding the ordered data structure, with referenceto FIG. 6, a simple list data structure 500 is shown, in which eachevent is 504 a, 504 b, 504 c is stored, at a respective storagelocation, in association with a respective pointer Pa, Pb, Pc to thestorage location at which another of the events is stored, i.e, the nextevent in the list, or some other indicator indicating the next event onthe list (and all description below applies equally to other suchindicators). As will be appreciated, this is a simplified example forthe purposes of the invention and more complex ordered data structurescan be used depending on the context. When a new event is allocated tothat node, it can be stored at any available storage location, and thepointers updated to preserve the ordering. For example, FIG. 6 shows howupon assignment of a new event 504 d for user D “usr4692” to node 302 f,the pointer Pb for event 316 b (also usr4692) that previously pointed tothe first event for “usr4693” (504 c), can be replaced with a pointerPb′ to the location at which the new event 316 d is stored in the localstorage 303 f; and it is now the new event 504 d that is stored inassociation with a pointer to event 504 c. It is not necessary to changethe physical storage locations, but a choice may be made to do so as anoptimization to prevent excessive fragmentation on the disk.

Note that sorting the events on disk in this way is also not essential:the fundamental efficiency savings stems from the grouping of eventsaccording to user ID (in any order), with one group per unique user ID.Sorting is simply a convenient and efficient way of implementing thegrouping.

The addition of a new event can be performed as part of the staged mergementioned above. Newly allocated events are temporarily held in a queueof the node 302, and then periodically merged with the existing eventsat that node in batches (optionally with some physical leveldefragmentation of the events on the disk at each merge). Alternatively,new events can be added in real-time as they are allocated (possiblywith some periodic defragmentation).

In this manner, within the content processing system 202 as a whole,events are grouped precisely according to user ID across the nodes 302.

To put it another way, the nodes 302 implement a form of distributeddatabase (distributed within the data centre) in which the events 316are stored and the database is (horizontally) partitioned across thenodes 302 according to user IDs into global (horizontal) partitions,also called “shards”, with one shard per node 302. The shard boundariesare chosen so that they never divide events for the same user, i.e. withall events for the same user in the same shard, such that the shardboundaries always coincide with group boundaries. As explained in moredetail below, this allows the shards to be processed in parallel by thenodes 302, to generate a local unique user count for each of the shards(i.e. at each node 302), and the local user counts across all of theshards (i.e. all of the nodes 302) can simply be summed to generate atotal user count with no risk of double-counting.

An indicated, incoming streams of time-dependent events are not normallypartitioned in this way based on user ID—it is more usual tohorizontally partition such events according to time.

In addition to this “global” partitioning across the nodes 302, locally,at each node 302, the shards can be further partitioned into(horizontal) sub-partitions for processing at that node, again such thatthe sub-partitions do not divide groups of count items for the sameuser, i.e. with all events for the same user in the same sub-partitionof that shard such that the sub-partition boundaries also alwayscoincide with group boundaries. As will be apparent in view of thefollowing, this allows the sub-partitions at each of the servers 302 tobe processed in parallel to generate respective local unique user countsfor those sub-partitions at that server, which can then simply be summedto generate the local user count for the server, i.e. for the shard as awhole.

For example, the shard may be sub-partitioned with one sub-partition perCPU. That is, within each of the nodes 302, each of the user identifiersassigned to that node 302 (server) can be assigned to exactly one of theCPUs in that server, where events with that ID are processed by thatsame CPU. That is, each group of events at each server is assigned toone CPU responsible for processing those events.

An alternative is to sub-partition between processing threads. Which oneis best depends on the architectural context. The basic idea is to avoidinefficiencies due to CPUs having to context switch between manythreads. If the main bottleneck is CPU utilization, having a CPUdedicated to each data partition is a good way of maximizing local CPUcaches and overall processing throughput. If the main bottleneck ismemory, network or other resources, then it may be efficient to havemore partitions than CPU cores, because the overhead of contextswitching can be compensated by making better use of CPUs in idle times.It is noted that, a CPU core or thread can be a content processor as theterm is used herein, with the present techniques being applied across,say, the CPU cores or threads of a server.

Each of the nodes 302 can access the metadata and user attributes ineach of the events in its ordered data structure. Either the contentitself, or a tokenised version of the content (see later) is held withthe metadata. By way of example, FIG. 4 shows metadata 317 f, 318 f andsets of user attributes 318 f, 318 g for the user identifiers in theevents 316 f, 316 g respectively.

The filter coordinator 306 can instruct all of the nodes 302 to filtertheir allocated events according to a desired filter, where a key aim isto obtain a count of the total number of unique users across all of thenodes 302 who satisfy the desired filter. The filter can be defined in aquery submitted to the system, and in this context the filtercoordinator 306 and total count generator 308 can be considered part ofthe query handler 210. In the context of the system described above, thefilter can be an interaction filter or a query filter, for example thequery filter used to generated results for a given bucket. As noted,multiple (possibly thousands) of query filters may be applied to respondto a single query, with counts being generated for each.

A filter can be defined in terms of user attributes or event metadata,such as metadata tags (see above), or a combination of both.

For a filter defined in terms of a set of one or more user attributesA1, events with matching user attributes satisfy that filter. For afilter defined in terms of a set of one or more tags T1, events withmatching tags satisfy that filter.

Before discussing the filtering operations, there follows a descriptionof how content can be stored. Content items can be held in theirentirety, for example if a particular field |(e.g. title) might beneeded verbatim as a value in a distribution query (discussed later).Alternatively, indexed tokens only may be held, with the originalcontent discarded, as explained below.

In order to make it efficient to filter a large corpus of documents bycertain keywords, a standard indexing technique (called “invertedindexing”) can be used, where each document is tokenised into individualterms, and then a lookup table is created for each term in the corpusdictionary, each pointing to a list of document IDs where such termappears. For example take a simplified example of two documents:

Doc001: the brown fox jumps over the lazy dogDoc002: the dog jumps over the sofa

Inverted Index:

location term (Doc#:position_in_the_doc) the Doc001:1, Doc001:6,Doc002:1, Doc2:5 brown Doc001:2 fox Doc001:3 jumps Doc001:4, Doc002:3over Doc001:5, Doc002:4 lazy Doc001:7 dog Doc001:8, Doc002:2 sofaDoc002:6

This way, if there is a requirement to know which documents mentionedthe word “dog”, all that is needed is a single lookup against theinverted index, instead of having to scan every single document in thecorpus each time.

If returning the original document verbatim is not a requirement of thesystem, storing the “tokenised” inverted index only is enough forfiltering and returning the correct document IDs. In this case, giventhe original document is discarded after indexing, it is necessary tokeep a copy if a re-index operation is needed.

A key flexibility of the content processing system 202 is that filteringcan performed on any desired combination of attributes A1 and tags T1.That is, a filter F(T1,A1) that is able to isolate content items basedon the content itself (based on the one or more tags T1), the users whohave published or consumed that content (based on the one or moreattributes), or a combination of both. For example, for a simple filterF=[T1 AND A1], any of the events which:

-   -   matches (all of) those tag(s) T1 and    -   is associated with (all of) those attribute(s) A1        matches that filter (and no other events). However it is also        possible to define more complex filters, using combinations of        logical operators, such as AND, OR and NOT.

Such filters constitute powerful building blocks for complex queries,that allow rich information about the specific type of content (definedin terms of tags T1) being shared between users in specific demographics(defined in terms of attributes A1) to be extracted from the filteredevents.

Regarding terminology, the term “filtered events” (or similar) refers tothe subset of events that do satisfy the filter. That is, the remainingevents after all of the other events have been filtered out. A user issaid to satisfy a filter when there is at least one event for that usercurrently assigned to one of the nodes 302 which satisfies that filter.

The aim is to determine:

-   -   1) the total number of unique events, across all of the nodes        302, that satisfy that filter.    -   2) the total number of unique users, across all of the nodes        302, that satisfy that filter F(TA,A1); and

As noted above, these two pieces of information are powerful buildingblocks when it comes to analysing the available data.

This information is embodied in a total user count and a totalinteraction count respectively, generated by the total count generator308 in the manner that will is described below.

With regards to 2), a key advantage of uniquely assigning each useridentifier to only one node only, and allocating all events to that nodefor filtering and aggregation, is that, when instructed to apply anygiven filter, each node can simply count the number of unique users whosatisfy that filter for the subset of events allocated to it, togenerate a local user count. Because there is no overlap in the users orevents between different nodes, the local user counts across all of thenodes 302 can simply be summed together to generate the total usercounts respectively, with no risk of users being double-counted.Moreover, by grouping the events according to user at each node, thiscount can be derived in an efficient boundary counting procedure thatfilters and counts events simultaneously.

With reference to FIGS. 5 and 5A, a process of generating a local usercount for a target partition will now be described, in which events aresimultaneously filtered and selectively counted (i.e. only counted ifthey satisfy the filter). FIG. 5 shows a flow chart for the process andFIG. 5A is a graphic illustrating certain principles of the process byway of example.

The process can be applied, in parallel, to each shard as a whole, atthe node 302 to which that shard is allocated to generate a local usercount for that shard. However, where the processing servers 302 havemultiple processing units, it is preferably applied to eachsub-partition of that shard in parallel, to generate respective localuser counts for those sub-partitions, which can then be summed togenerate the local user count for that shard. Accordingly, the targetpartition to which the process is applied can be a shard, or asub-partition of a shard.

The process begins at step S202, by the filter coordinator 306instructing all of the nodes 302 to apply a filter F(T1,A1).

A boundary count variable (boundaryCount) is used to count total uniqueusers who satisfy the filter, by counting boundaries between any groupsof events that satisfy the filter, which is possible because they aregrouped according to user ID. An interaction count variable(interactionCount) counts the total number of interactions that satisfythe filter, by simply counting all events that satisfy the filter.

FIG. 5A illustrates the boundary counting principle applied to theevents 316 f and 316 g of nodes 302 f and 302 g respectively for afilter F=[T1 AND A1]. Events that satisfy the filter [T1 AND A1]—thatis, whose user has (all of) the attribute(s) A1 and which matches (allof) the tag(s) T1—are shown bold on the right-hand-side, andseparated-out on the left-hand side from the remaining events that donot; the latter are also greyed out on the left-hand side, and includeevents whose users have attributes A1 but do not match tags T1 and viceversa. As shown, there may be multiple events for a particular user thatsatisfy the filter; however, whilst this is certainly germane to theinteraction count, it should not affect the unique user count. Rather,for the latter, what matters is the number of boundaries (shown as thickblack lines) between the remaining groups of events. These boundariescan be counted very efficiently when the events are grouped (e.g.sorted) according to user ID, and the final boundary counts acrossmultiple partitions can simply be summed with no risk ofdouble-counting.

Returning to FIG. 5, boundaryCount and interactionCount are initializedto zero (S204) and, for a first of the event groups (206)—that is, afirst of the user IDs assigned to the target partition—the events inthat group are processed in turn to determine whether any of thoseevents match the filter F (S208). That is, starting with a first of theevents in that group, the process checks each event one-by-one todetermine whether it matches the filter F, in a boundary countingprocedure S208 that is performed for the current group until all eventsin that group have been checked.

The boundary counting procedure S208 proceed for the current group asfollows, commencing with the first event in the current group.

At step S210, the current event in the current group is processed todetermine whether it matches the filter F. If not, the method proceedsto step S216, where if the end of the group has not yet been reached,i.e. there are still event(s) in the current group to be checked, theboundary counting procedure S208 continues for the next event in thecurrent group (218), commencing with step S210 for that event.

However, if the current event does match the filter F at step S210, themethod branches depending on whether this is the first time an eventmatching the filter F has been found in the current group (S210):

-   -   if so, boundaryCount is incremented by one (S212) and        interactionCount is incremented by one (S214). Although shown in        that order, steps S212-214 can be performed in reverse order or        in parallel. Both counters are incremented because the matching        event not only represents another interaction matching the        filter F but one also performed by a user who has not been        counted yet;    -   if not, then the method branches straight to step S210 to        increment interactionCount without incrementing boundaryCount.        This is because, although that event constitutes a unique        interaction, it is one performed by a user who has already been        counted.

From step S214, with one or both counters incremented, the processproceeds to step S216, and from there as described above.

When the end of the current group is reached (S216, “Y” branch), themethod proceeds to step S220, where if there is still at least one moregroup to be checked (that is, one or more users in the partition whohave not been checked), the boundary counting procedure S208 is repeatedas described above for the next group (222) in the partition, i.e. thenext user, commencing at step S210 for the first event in the nextgroup—noting that, now that the process has moved on to the next user,it is necessary to increment boundaryCount once more (S212) shouldanother event satisfying the filter F be located.

Once every group in the partition has been checked in full, the methodterminates at step S224 (S220, “N” branch). As will be apparent, thevalues of boundaryCount and interactionCount at this point now reflect,respectively, the total number of unique users and total number ofevents that satisfy the filter F, for the partition that has just beenprocessed.

For sub-partitions of a shard (where applicable), the final counts asgenerated across all sub-partitions of that shard can simply be summedto give local user and interaction accounts for that shard, with no riskof users being double-counted in the user count. Across all shards, thelocal user counts for those shards can, likewise, simply be summed togive total user and interaction counts across the system 202 as a whole,again with no risk of users being double-counted.

It is noted that the process of FIG. 5 is exemplary, and optimizationsand extensions are envisaged that remain within the scope of the presentinvention. For example, in some contexts, it may be possible to skipthrough events for certain users, where it has already been determinedthat those users do not have the desired attributes to optimize theprocess.

What is more, the process can be readily extended to more than twovariables. That is, additional counters may be used to increase thelevel of information extracted.

For example, in the case of a query filter F and requesting that theresults of that filtering be broken down into multiple buckets B1, . . ., BM (that is, with F defining a subset in the index, with additionalfiltering of the subset to provide the breakdown), respective counterscan be maintained for each bucket, to provide a count for every bucketin a single iteration of the process. In that case, separate interactionand boundary counters can be updated for each bucket, in addition tooverall counts for the subset defined by F is desired.

Privacy Policy:

As indicated above, to implement a robust privacy policy, variousprivacy controls are placed on the output of information from thecontent processing system to ensure that the users' privacy is preservedat all times. The goal is to prevent the identification of individualusers in the results and also prevent the disclosure of usercharacteristics that can be attributed to an individual user (forexample, to prevent the results from being used to infer the age, genderor other user attribute of a particular social media user).

These privacy controls will now be described in greater detail, alongwith explanations of various attack vectors that these constrains renderineffective. As will become apparent in view of the following, not onlyare these constrains highly effective, protecting against a number ofattack vectors, they are also fast to apply and easy for users tounderstand. They can also be updated dynamically, i.e. “on-the-fly”,without downtime.

Some important aspects of the privacy policy can be summarized asfollows:

-   -   Quantization of results and minimum audience size: to prevent        disclosing tiny audiences within larger audiences by doing set        algebra across query results;    -   Redaction controls: minimum number of unique authors for overall        audience and each data bucket;    -   Jitter/noise: prevent other attacks by slightly “fuzzing” the        results before quantization where necessary, or otherwise        ensuring that the counts are generated with sufficient error        before quantization (see below).

In addition, each of user events is purged from the index upon expiry ofa retention period for that user event (e.g. ˜30 days from a timing ofthat event), whereby user events are not counted once purged. Moreover,user events for users who are under 18 threshold are never counted.

FIG. 7 shows a schematic high-level overview illustrating certainunderlying principles relating to audience-size gating, quantization andredaction.

Block 702 represents all events in an index on which a query is to berun. That is, all events that satisfy an interaction filter (˜606, FIG.1A) used to build that index. As noted, a user can run a query on asubset of the events 702 in the index. For example, the user may wish torestrict the parameters of the query to certain topics, or to certainuser demographics (age, gender, job function etc.). By way of example,first and second subsets 704 a, 704 b for first and second queries areshown defined by first and second query filters respectively (˜626 a,FIG. 1B).

Audience-Size Gating:

In accordance with the privacy policy, queries are only permitted onsubsets of events having at least a minimum number of unique users. Thatis, where the overall unique user count for events satisfying the queryfilter is at least as great as a minimum audience-size gating thresholdS. This can be fixed threshold, such as S=1000 users; in which case, anyquery filter for which the overall unique user count is less than thiswill be rejected outright, with no information returned at all for thatquery filter. In FIG. 7, the unique user count across the second subsetof events 704 b is below the minimum audience-size gating threshold,therefore the second query is rejected outright (block 706) with noinformation about that second subset 704 b being released. However, theunique user count across the first subset of events 704 a exceeds theminimum audience-size gating threshold, therefore the first query isaccepted (block 708). Consequently, a set of results 709 is generated inresponse to the second query for releasing to the user.

By way of example, FIG. 7A shows how a query filter for “Pepsi lovers inAlameda” is rejected because the overall number of unique users withinthe index who satisfy that query filter (˜second subset 704 b) is belowthe minimum audience-size gating threshold S. However, a query filterfor “Pepsi lovers in San Francisco, Oakland or Alameda” is acceptedbecause a greater number of users in the index satisfy that filter(˜first subset 704 a), to the extent that the overall unique user countfor that query exceeds the minimum audience-size gating threshold S.

Quantization:

For accepted queries, the results are also quantized and selectivelyredacted before any aggregated information is released responsive to aquery. Quantizing implies that when counts are organized in buckets(˜656.1-4, FIG. 1B), they are ‘rounded’ before appearing in releasedinformation. Moreover, if a bucket has a user count below a certainnumber, this bucket is redacted entirely such that neither the usercount nor the interaction count for that bucket is released. In somecases, the existence of a redacted bucket may not be revealed at all. Insuch cases, the bucket name is also redacted as well as the counts:merely knowing that a certain buckets exists (albeit in volumes lowerthan the redaction threshold) could represent by itself a leak ofinformation, in certain cases, especially when the full list of possiblevalues for a certain field is not known a-priori.

Returning to FIG. 7, for queries that are accepted, quantization andredaction is applied to the resulting user counts and interaction counts(block 710, FIG. 7). In this example, in response to the second query, abreakdown of the results for the first subset of events 704 a isprovided, for example by age, gender, job function, topic orcombinations thereof etc. That is, individual counts are provided fornarrower subsets of the events 704 a (buckets), each defined by a bucketfilter (˜626 b, FIG. 1B) applied on top of the query filter. By way ofexample, FIG. 7 shows three buckets within the subset of events 704 a.

The user and interaction counts for each bucket can for example bequantized by rounding each of them down to the nearest multiple of achosen integer ΔQ. That is, such that any count released to a customeris always a matching one of a set of quantized values:

{Q _(n) }={Q ₀=0, Q ₁ =ΔQ, Q ₂=2*ΔQ, Q ₃=3*ΔQ, Q ₄=4*ΔQ, . . . }

where those values Q_(n) are referred to herein as “quantizationboundaries”. In this context ΔQ constitutes a fixed “quantization range”of the quantization, i.e. the difference between adjacent pairs ofquantization boundaries, which is fixed in that it is the same for alladjacent pairs of quantization boundaries. That is, Q_(n+1)−Q_(n)=ΔQ forall n:

-   -   Q₁−Q₀=ΔQ    -   Q₂−Q₁=ΔQ    -   Q₃−Q₂=ΔQ    -   Q₄−Q₃=ΔQ    -   etc.

The set of results in FIG. 7 is also shown to include an overall uniqueuser count and overall interaction count for the subset of events 704 aas a whole. These are also quantized in the same way.

The un-quantized (original) counts are never released by the system—onlythe quantized counts are ever released to customers, and those are onlyreleased when they are not less than a redaction threshold.

Redaction:

Where the unique user count for any given bucket is less than aredaction threshold R, the results for that bucket are “redacted” i.e.neither a unique user count nor an interaction count is released forthat bucket. That is, apart from the fact that it may be possible, incertain circumstances, to infer from the results that there are fewerthan R unique users for that bucket, all information for that bucket iswithheld. For convenience, R=ΔQ may be chosen, so that any bucket with aquantized user count of zero is redacted. However, this is notessential, for example a higher redaction threshold could be set.

The fact that the result has been redacted for a particular bucket maybe explicitly indicated in the results set 709 (as in FIG. 7 for bucket2), or both the user count and the interaction count may simply be givenas zero. Alternatively, that bucket may simply be omitted from theresults set 709 all together, such that it is not evident from theresults set 709 that this bucket has even been considered. For example,where a customer requests a breakdown by topic, buckets for topics thatdo not reach the reaction threshold may simply never be revealed to thecustomer.

FIG. 7B shows a specific example, where the set of results 709 is brokendown according to age brackets. That is, where the buckets correspond touser age brackets. For each age brackets, (non-quantized) user andinteraction counts are generated within the system, but never released(note only one count is shown for each bucket in FIG. 7B). First theseare quantized or—for any age bracket whose unique user count is belowthe redaction threshold R—redacted, such that neither the unique usercount nor the interaction count for that age bracket are released.

Attack Scenario Walkthroughs Homogenous Group Attack

Minimum audience sizing gating (threshold S) ensures only audiences overa certain size can be analysed. This makes it more difficult for amalicious user to execute a “homogeneous group search attack”, in whicha sensitive attribute of an individual is inferred through theirmembership of a group of people all sharing the same attribute. This isbecause it is very difficult to arrange for homogeneity in largeraudiences, especially when analyzing their organic activity rather thanthe static audience.

Set Balancing Attack

In the face of the constraint that only buckets with more than 100authors are made visible, set balancing attacks attempt to retrieve ademographic attribute of an individual by comparing results frommultiple larger audiences.

Suppose an attacker knows that a specific man tweeted on a uniquehashtag on a specific day and wants to discover his age. He attemptsthis by running two queries:

-   -   Query A: age breakdown of all women.    -   Query B: age breakdown of the audience consisting of all women        plus men tweeting on the unique hashtag;

The age breakdown of all women servers as “padding”: since the audienceof all women is large, he expects to get results for both queries, so hecompares the counts given in A and B for each age bracket, expecting oneof the buckets to have a higher count, likely corresponding to the smallgroup of users the individual man belongs to. The padding set is usuallyselected so that it is mutually exclusive with the small set underobservation (i.e. the two sets have no overlap, so the set differenceoperation is straightforward)—all women vs. specific man in thisexample. For example, without quantization, the results he gets backmight look as follows:

Query A Query B user count interaction count user count interactioncount 18-24 1104752  1214998  1104752  1214998  25-34 927150 1029488 927150 1029488  35-44 669702 693816 669702 693816 45-54 35843 4 4193 4935843 5 4193 50 55-64 203391 221958 203391 221958 65+ 112265 124850112265 124850

As is evident, the man in question shows up as an extra user in the45-54 age bracket, and his tweet shows up as an extra interaction forthat bracket. Consequently, the attacker can infer that the man's age is45-54.

However, with quantization of results, in 99% of cases queries A and Bwill return the same quantized counts in this scenario. For example,with the above example, each count is rounded down to the nearesthundred before release, completely obscuring the man within the results:

Query A Query B quantized quantized quantized quantized user countinteraction count user count interaction count 18-24 1104700 12149001104700 1214900 25-34 927100 1029400 927100 1029400 35-44 669700 693800669700 693800 45-54 358400 419300 358400 419300 55-64 20300 221900 20300221900 65+ 112200 124800 112200 124800

However, where the size of the padding is 99 mod 100, a situation canstill arise when addition of the individual can be detected, forexample, assuming the un-quantized results were in fact:

Query A Query B user count interaction count user count interactioncount 18-24 1104752  1214998 1104752  1214998 25-34 927150 1029488927150 1029488 35-44 669702 693816 669702 693816 45-54 358 499 419349358 500 419350 55-64 203391 221958 203391 221958 65+ 112265 124850112265 124850

In this case, the quantized results returned would be:

Query A Query B quantized quantized quantized quantized user countinteraction count user count interaction count 18-24 1104700  12149001104700  1214900 25-34 927100 1029400 927100 1029400 35-44 669700 693800669700 693800 45-54 358 400 419300 358 500 419300 55-64  20300 221900 20300 221900 65+ 112200 124800 112200 124800

In this specific instance, it would still be possible to infer the ageof the man in question as 45-54 from the higher quantized user count forquery B, even though the un-quantized results are never released.

As another example, assume that the man in question had tweeted on theunique hash tag three times on one day. In this case, the underlyingcounts might for example be:

Query A Query B user count interaction count user count interactioncount 18-24 1104752 1214998  1104752 1214998  25-34 927150 1029488 927150 1029488  35-44 669702 693816 669702 693816 45-54 358434 419 397358435 419 400 55-64 203391 221958 203391 221958 65+ 112265 124850112265 124850

This gives quantized counts of:

Query A Query B quantized quantized quantized quantized user countinteraction count user count interaction count 18-24 1104700 1214900 1104700 1214998  25-34 927100 1029400  927100 1029488  35-44 669700693800 669700 693816 45-54 358400 419 300 358400 419 400 55-64 203300221900 203300 221900 65+ 112200 124800 112200 124800

Again, it becomes possible in this case to infer the man's age bracket,this time from the higher quantized interaction count on query B.

In order to take advantage of this, a malicious user would have to beable to precisely control the size of the “spectator” audience (i.e. thepadding), and the nature of the content processing system makes thisvery difficult. In the context of the described system, the “group ofall women” is actually all women who are actively publishing/consumingcontent in the given time period, i.e. it's not just based on thedemographic properties of the member base, but it's based on theiractivity. Such activity is difficult to control making a set balancingattack on the quantization boundaries very difficult to engineer.Nevertheless, despite the inherent difficulty in exploiting this, it isstill a potential point of vulnerability that can't be ignored.

However, it is also noted that, it the walkthrough of the attack above,it is implicitly assumed that the counts are error free.

When HLL or other probabilistic counting method are used, this is notthe case: for example, the audience size counts, before being quantized,inherently include some random noise, and are therefore only approximatecounts with an intrinsic error. The noise is intrinsic to theimplementation of the unique author counting algorithm, which relies onprobabilistic data structures, rather than individually counting userevents. The same would apply to a probabilistic estimation of theinteraction counts.

As context, it is useful to consider some of the principles underlyingHLL and its variants. At is core, HLL exploits the followingobservation: when a set of values is hashed—that is, a hash function isapplied to each value to generate a fixed length bit sequence (hashvalue) where the probability of each bit being a 1 or 0 is more or less50/50 and independent of the other bits—then, for a large enough dataset, the resulting hash values will look like:

form of hash expected percentage value of this form 1xxxxxxxxx  ~50%probability of the first bit being a 1 is ~1/2 01xxxxxxxx . . .  ~25%probability (1/2){circumflex over ( )}2 001xxxxxxx . . . ~12.5%probability (1/2){circumflex over ( )}3 0001xxxxxx . . . ~6.25%probability (1/2){circumflex over ( )}4 00001xxxxx . . . ~3.125% probability (1/2){circumflex over ( )}5 . . . . . . . . . 0000000001 . ..  ~0.1% probability (1/2){circumflex over ( )}10

Now, looking at this the other way round and in very high-level terms,when, say, there is a single hash value of the form 0000000001xxobserved in a set of hash values computed in thus way, chances are there˜1000 unique values in the original set [0.1% of 1000=1 such hash valueexpected]. Applied to events, each event can be hashed, and, in essence,all the system needs to keep track of is the longest run of 0's observedso far in the hashed values to estimate the number of unique eventsobserved, without having to count those events individually. HLL buildson this a very coarse approximation, in order to reduce the margin oferror—but the basic premise is the same.

The inventors of the present invention have recognized that, with suchprobabilistic estimation techniques, the intrinsic errors in the resultcounts can be tuned to counter set-balancing attacks on the quantizationboundaries. That is, the intrinsic error can be tuned so that it issufficiently large that it is not possible for an attacker to attributevariations between the quantized counts for a given bucket in queries Aand B to the activity of an individual user as opposed to the intrinsicerror. HLL and its variants have an error rate that depends on thecardinality of possible IDs in the set and the amount of memory that isallocated to each HLL object (and the error rate can be tuned by tuningthe memory allocation).

It is noted that HLL and its variants are just some examples. Otherprobabilistic procedures can also be used to generate an approximatecount. Generally, with probabilistic counting, it may be sufficient torely on the intrinsic error to provide the necessary protection providedit is of sufficient magnitude, i.e. at least 2-3%. This may need sometuning to ensure a sufficient error.

However, the inventors of the present invention have also recognizedthat moving to an exact counting procedure of the kind described above,in which user events satisfying a filter are individually identified andcounted instead, the intrinsic error is lost. Without something toreplace it, the system would become more vulnerable to this form ofset-balancing attack. This applies both to the exact user countingprocedure and the exact interaction counting procedure.

To prevent this, where an exacting counting procedure is used, somewhatcounterintuitively, a small randomized “jitter” (artificial, randomerror) is applied within the content processing system 202 to introducesome deliberate error into the otherwise-exact user and interactioncounts before they are quantized. That is, artificial, error data israndomly generated and used to introduce a deliberate error into theotherwise-exact counting procedure (i.e. which is exact but for thaterror). The result is an inexact (approximate) count deviating, by arandom amount, from the exact count that would be obtained were it notfor this error data. The approximate count is close enough to the actualvalue to be useful, but not exact so as to protect privacy ofindividuals (of course, from time to time an approximate count mighthappen to equal the exact count, where a randomly applied deviationhappens to be zero—what matters is that this is not guaranteed, suchthat a range of different randomized deviations is exhibited acrossmultiple counts to guarantee uncertainty).

For example, a plus/minus offset, may be randomly generated for each ofthe counts (overall user and interaction counts) and applied that count.

It may be convenient to compute the exact count first, and then applythe error to this exact count. This way, an error value can be randomlyselected from a percentage range of the total count, for example anerror of around ±2% of the exact count is expected to be sufficient,though this is context dependent and may vary depending on audiencesize. In general, the error in the count is small compared with thecount itself to ensure that the count is still a useful measure, butlarge enough to mask individual users and their activity near thequantization boundaries. Striking a suitable balance will be contextdependence, and can be achieved by way of routine design procedure inview of the teaching presented herein.

This error can be introduced at any stage in the counting process. Forexample, in the architecture of FIG. 4, the jitter can be applieddirectly to the count across all nodes, by the total count generator308, but it could also be applied to the local counts at the nodes 302.In any event, the upshot is that the content processing system 202generates inexact user and interaction counts for each bucket, whichcorrespond to the exact user and interaction count across the eventsadjusted intentionally by a randomly selected amount.

Such error data is generated individually for each count, to ensure thatthey exhibit a range of different errors across the set of generatedcounts.

As noted the term “random” herein is used to mean unpredictable in thesense set out above. In the specific case of HLL, it is noted thatintrinsic HLL jitter is deterministic in that it always gives the sameresult when applied to the same set of user ids into it, regardless ofthe order in which or frequencies (>0 of course) with which each userappears. This is one benefit of HLL: if the same query is run twice, thesame results are returned twice for as long as the number of uniqueusers remains unchanged. However, the jitter is still unpredictableacross different queries, hence HLL jitter is still consideredrandom/unpredictable as those terms are used herein.

In general, applying the same jitter for duplicate queries may bepreferred in some contexts, as it can provide more consistent andtherefore intuitive results—what matters is that the jitter isunpredictable across different queries, as this is what provides thedesired protection.

Metering Strategic Populations

In the context of social media, “metering” herein refers to an attemptto obtain or extrapolate information about activity across the platformas a whole, by running very broad queries. An extreme case would beattempting to obtain information about the activity of all users of theplatform, for example an interaction count for all of the socialinteractions on the platform of a particular type (e.g. likes, shares,posts, tweets, re-tweets etc.) or a user count for all active users (inany demographic) in a given month.

There are various approaches for preventing this type of metering.

The operating model can be restricted to a “private recording” (i.e.bespoke index—see above) model, where daily volumes are capped. That is,where each customer is only permitted to record up to a certain numberof events in any of his indexes on any given day. For large social mediaplatforms, this may be sufficient, for example where the cap may inpractice correspond to, say, less than 1% of the overall daily activityon the platform. However for smaller platforms where the daily capsrepresent a much higher proportion of daily activity, additional stepsmay be needed to prevent metering. Moreover, restricting to a privateoperating model may be undesirable in some circumstances.

Metering may also be prohibited contractually and it can be monitoredfor through analysis queries submitted. All usage can be tracked to acompany or other customer, via the account used to submit the queries.However, this is not necessarily straightforward, and the monitoringrequires dedicated resources and is potentially error prone.

Therefore, in the case of a shared index, in which (say) all of theevents are held and queries can be run by any customer on the sharedindex, or smaller platforms where the daily caps on private bespokeindexes may not be sufficient alone, additional technical constraintsmay be imposed to prevent metering in order to reject queries that aretoo broad automatically.

A simple and effective way of achieving this is to return only if totalaudience size is smaller than a maximum permitted number of users (setby a metering threshold). In this case, as illustrated in FIG. 8 theoverall audience size must not be less than the minimum audience-sizegating threshold, but must also not exceed the maximum number ofpermitted users.

The question of where the maximum threshold should be set is contextdependent. As a very general rule of thumb, it is expected that an upperlimit of ˜5% of the total number of active users on the platform as awhole within a relevant time period (global unique user count) would besufficient, but this may vary depending on the circumstances. Astatistical analysis of network traffic for the social media platformmay be performed to determine a suitable upper limit. For example astatistical analysis of the events received from the platform. Here, theaim is to set a threshold such that the maximum sample size is too smallto draw any statistically significant inferences about activity acrossthe platform as a whole.

Note that computing a unique user count for such queries is intensive,as for broad queries that count can be very large: in order to determinewhether or not the query breaches the maximum audience size-gate (whichcould be millions of users for large platforms), at least that manyevents need to be counted before the system can decide to reject thequery. In the case that the query is ultimately rejected, a significantamount of computing resources are “wasted” in order to reach thatdecision.

Therefore, as an optimization, it may be preferable, in determiningwhether to accept or reject a query, to initially estimate the uniqueuser count from a representative sample of the events in the index. Thatis, by filtering and counting only a representative subset of events inthe index. This estimated count may have a relatively large margin oferror but which still represents a small a percentage of the maximumaudience size. This allows the decision of whether to reject the queryon the ground of metering to be made more efficiently.

If the maximum cap is not exceeded, then a more accurate count can bedetermined from the index as a whole in the manner described above. Thiscan then be used to check whether the minimum audience-size gatingconstraint is also met (the estimated count would generally not besuitable for this, because the error will be a much larger percentageerror of the much lower gating threshold), and if so the query can beaccepted and responded to as set out above.

Variable Quantization/Redaction Parameters

The minimum audience-size gate S, quantization range ΔQ and redactionthreshold R are individually configurable system parameters of thecontent processing system 202.

In the above examples, for convenience, S, ΔQ and R are fixed at 1000,100 and 100 respectively. Whilst this is a viable approach, andpreferred in some contexts for its simplicity and transparency, thereare circumstances in which it may be desirable to vary these parameters.

For example, it may be desirable for the quantization range ΔQ to varywith the quantized values, where the notation ΔQ(*) is used to representthe variable quantization range. That is, where each count is quantizedto one of a set of quantized values {Q_(n)}={Q₀, Q₁, Q₂, Q₃, Q₄, . . . }where:

-   -   ΔQ(m)≠ΔQ(m−1) for at least one m, with    -   ΔQ(n):=Q_(n+1)−Q_(n) being the difference between the n'th and        [n+1]'th quantization boundaries.

For example, it may be that ΔQ(n) increases linearly with n (at leastapproximately), so that the quantization error remains an approximatelyfixed percentage of the count being quantized for different sizes ofcount.

The audience-size gate R for a given query may also be set in dependenceon the query itself. That is, with different minimum audience-size gatesfor different queries, depending on the parameter(s) of those queries.For example, it may be desirable to set a higher minimum audience-sizegating threshold for certain audiences, which can be achieved by settingthe gating threshold in dependence on a user attribute(s) defined in orderived from the query. This also applies to the individual bucketsredaction thresholds and to the bucket sizes (quantization ranges),which can also be set in dependence on the query itself, for example asa function of the user attribute(s) and/or other query parameter(s).

In embodiments, the exact counting procedure can be implemented asfollows.

A content processing system for processing content items (interactionevents) of a content publication platform having a plurality of usersmay be provided, the content processing system comprising: a contentinput configured to receive content items of the content publicationplatform, each of which relates to a piece of published content and isassociated with a user identifier of one of the users who has publishedor consumed that piece of content; a plurality of content processors forprocessing the content items; a content allocator configured to allocatethe content items to the content processors based on the useridentifiers associated with the content items; and a total countgenerator; wherein each of the user identifiers is assigned to one ofthe content processors, and the content allocator is configured toallocate all of the content items associated with that user identifierto that same content processor; wherein each of the content processorsis configured to generate, from the content items allocated to it, alocal user count indicating a number of unique user identifiersassociated with those content items, wherein the total count generatoris configured to generate, by summing the local user counts from all ofthe content processors, a total user count indicating a total number ofunique users of the content publishing platform.

Allocating the content items in this way allows the local and totalusers count to be generated extremely quickly and efficiently, asexplained in detail later.

At each of the content processors, the content items allocated to thatcontent processor may be stored in local storage of that contentprocessor.

The content items may be grouped in the local storage according to theuser identifiers associated with those content items with one group peruser identifier.

The content processing system may comprise a filter coordinatorconfigured to instruct each of the content processors to apply a filter,thereby causing each of the content processors to filter the contentitems allocated to it according to the filter and generate its localuser count from the filtered content items, wherein the local user countindicates a number of unique users for the content items allocated tothat content processor that satisfy the filter, whereby the total usercount indicates a total number of unique users who satisfy the filter.

In the case that the content items are stored and grouped locally, eachof the content processors may be configured to repeatedly apply aboundary counting procedure to the grouped content items to selectivelyincrement a local boundary count as follows: applying the boundarycounting procedure for an initial one of the groups by: for each of theitems in that group, determining whether that content item satisfies thefilter, incrementing the local boundary count only if at least one ofthose content items satisfied the filter, and repeating the boundarycounting procedure for a next one of groups; wherein the boundarycounting procedure terminates once it has been applied for all of theuser identifiers assigned to that content processor, wherein thatcontent processor unit's local user count comprises or is derived fromits local boundary count after the termination of the boundary countingprocedure.

Each of the user identifiers may be associated with a set of userattributes, and the content items may be filtered based on the userattributes associated with the user identifiers; or at least some of thecontent items may comprise metadata (e.g. at least some which may bederived from the pieces of content) and the content items may befiltered based on the metadata in the content items; or the contentitems may be filtered based on a combination of the user attributes andthe metadata.

The content processing system may comprise a query handler for handlingqueries submitted to the content processing system, and the queryhandler may be configured to respond to a submitted query with a resultcomprising or derived from the total user count.

The filter may be defined in the query and the filter coordinator mayinstruct the content processors to apply the defined filter in responseto the submission of the query.

Each of the content items may comprise a copy of the user identifierwith which it is associated, thereby associating that content item withthat user identifier.

Each of the content processor may be a processing unit or a thread. Forexample, each of the content processing units may be a CPU core threadin a server (in which case the total user count is a count for thatserver).

Alternatively, each of the content processors may comprise multipleprocessing units or multiple threads. For example, the contentprocessing units may be servers (in which case the total user count is acount across those servers). In that case, each content processor may beconfigured to apply the boundary counting procedure in parallel tomultiple sub-partitions of the content items allocated to it, whereineach of the user identifiers for that content processor is assigned toonly one of the sub-partitions and all of the content items associatedwith that user identifier are in that same partition, wherein the localuser account for that content processor is generated by summingresulting local user counts for the sub-partitions.

To cover other application, a system for processing user events from aplatform having a plurality of users may be provided, the systemcomprising: an input configured to receive user events of the platform,each of which is associated with an identifier of one of users of theplatform; a plurality of event processors for processing the userevents; an event allocator configured to allocate the user events to theevent processors based on the user identifiers associated with theevents; and a total count generator; wherein each of the user events isassigned to one of the event processors, and the event allocator isconfigured to allocate all of the user events associated with that useridentifier to that same event processor; wherein each of the eventprocessors is configured to generate, from the user events allocated toit, a local user count indicating a number of unique user identifiersassociated with those user events, wherein the total count generator isconfigured to generate, by summing the local user counts from all of theevent processors, a total user count indicating a total number of uniqueusers of the platform.

In this context, each event can be any event relating to the user withwhich it is associated. Each of the user events may relate to an actionperformed by or otherwise relating to one of the users of the platformand comprise an identifier of that user. That is, each of the userevents may be a record of a user-related action on the platform.

Whilst these events can relate to social interactions on a social mediaplatform (publishing/consuming content), the fourth aspect of theinvention is not limited to this and the system can be used forprocessing other types of events and the platform can be any platformwith a user base that facilitates user actions. The platform providercould for example be a telecoms operator like Vodafone or Verizon, acar-hire/ride-share platform like Uber, an online market place likeAmazon, a platform for managing medical records. The events can forexample be records of calls, car rides, financial transactions, changesto medical records etc. conducted, arranged or performed via theplatform. There are numerous scenarios in which it is beneficial toextract anonymous and aggregated information from such events, where theneed to obtain a user count over a set of such events arises.

In this respect, it is noted that all description pertaining tointeraction events of a social media platform (content items) hereinapplies equally to other types of events of platforms other than socialmedia. Such events can comprise or be associated with user attributesand/or metadata for the actions to which they relate, allowing thoseevents to be processed (e.g. filtered and/or aggregated) using any ofthe techniques described herein.

For example, the system may comprise a filter coordinator configured toinstruct each of the event processors to apply a filter, thereby causingeach of the event processors to filter the user events allocated to itaccording to the filter and generate its local user count from thefiltered user events, wherein the local user count indicates a number ofunique users for the user events allocated to that event processor thatsatisfy the filter, whereby the total user count indicates a totalnumber of unique users who satisfy the filter.

A randomization component of the system can randomly generate error dataand apply it to intentionally introduce an artificial error in to thetotal count (either directly or in at least one of the local counts) inthe manner described above.

Exact User Count

A notable feature of the filtering and counting process described aboveis that, in contrast to the probabilistic HLL approximation, the totaluser count obtained by the process is exact. Obtaining an exact usercount is clearly desirable in some contexts, and in that respect isanother tangible benefit of the present invention with respect toexisting probabilistic methodologies, such as HLL.

However, as noted, the inventors of the present invention haverecognized that in the present context, namely extracting anonymized,aggregate information, moving from an approximate user count (with aninherent error) to an exact user count could in fact open up the contentsharing system 202 to a specific form of attack that allows individualusers to be identified in certain circumstances—hence the additionalsteps that are taken to prevent this.

It will be appreciated that the above embodiments have been describedonly by way of example. Other variations and applications of the presentinvention will be apparent to the person skilled in the art in view ofthe disclosure given herein. The present invention is not limited by thedescribed embodiments, but only by the appendant claims.

1. A method of processing user events of a platform to extract aggregateinformation about users of the platform, the method comprising, at anevent processing system: receiving a query relating to the user events;determining at least one query parameter from the query; computing aunique user count for a set of the user events satisfying the at leastone query parameter; comparing the unique user count to a meteringthreshold; and rejecting the query if the unique user count exceeds amaximum permitted user count indicated by the metering threshold.
 2. Amethod according to claim 1, wherein the unique user count that iscompared to the metering threshold is estimated from a representativesample of the user events in an index.
 3. A method according to claim 2,wherein if the unique user count does not exceed the maximum permitteduser count, the unique user count is re-computed from a larger number ofthe user events in the index.
 4. A method according to claim 3, whereinthe re-computed user count is compared with a gating threshold, whereinthe query is rejected if the re-computed user count is less than aminimum permitted user count indicated by the gating threshold andaccepted otherwise.
 5. A method according to claim 3, wherein the uniqueuser count is re-computed from all of the user events in the index.
 6. Amethod according to claim 1, wherein the metering threshold is set as afunction of a global unique user count for the platform.
 7. A methodaccording to claim 6, wherein the maximum permitted user count is set asa percentage of the global unique user count for the platform.
 8. Amethod according to claim 1, wherein the metering threshold is set independence on a statistical analysis of the user events.
 9. A method ofprocessing user events of a platform to extract aggregate informationabout users of the platform, the method comprising, at an eventprocessing system: receiving a query relating the user events;determining at least one query parameter from the query; generating atleast one count for a set of the user events satisfying the at least onequery parameter; and applying quantization to the at least one count togenerate at least one quantized count for release, the quantized countbeing one of a plurality of permitted quantized values, wherein thequantization has a variable quantization range, the quantization rangebeing the difference between adjacent pairs of the permitted quantizedvalues.
 10. A method according to claim 9, wherein the quantizationrange increases for larger permitted quantized values.
 11. A methodaccording to claim 10, wherein the quantization range increases linearlywith respect to the permitted quantized values.
 12. A method accordingto claim 9, wherein the quantization range is set as a function of theat least one query parameter.
 13. A method of processing user events ofa platform to extract aggregate information about users of the platform,the method comprising, at an event processing system: receiving a queryrelating to the user events; determining at least one query parameterfrom the query; computing a unique user count for a set of the userevents satisfying the query parameter; setting a variable releasethreshold for the query as a function of the at least one queryparameter; and comparing the unique user count with the releasethreshold set for the query to determine whether to release informationabout the set of user events in response to the query.
 14. A methodaccording to claim 13, wherein the release threshold is a gatingthreshold and the query is rejected if the unique user count is lessthan a minimum permitted user count indicated by the gating thresholdset for the query, whereby the minimum permitted user count depends onthe at least one query parameter.
 15. A method according to claim 14,wherein the at least one query parameter comprises a user attribute, thevariable gating threshold being set as a function of the user attribute.16. A method according to claim 15, wherein the release threshold is aredaction threshold for one of a plurality of buckets, the comparingstep performed to determine whether to redact that bucket.
 17. An eventprocessing system for processing user events of a platform to extractaggregate information about users of the platform, the event processingsystem comprising: computer storage holding executable instructions; andone or more processing units configured to execute those instructions tocarry out the following steps: receiving a query relating to the userevents; determining at least one query parameter from the query;computing a unique user count for a set of the user events satisfyingthe at least one query parameter; comparing the unique user count to ametering threshold; and rejecting the query if the unique user countexceeds a maximum permitted user count indicated by the meteringthreshold.
 18. An event processing system according to claim 17, whereinthe unique user count that is compared to the metering threshold isestimated from a representative sample of the user events in an index.19. An event processing system according to claim 18, wherein if theunique user count does not exceed the maximum permitted user count, theunique user count is re-computed from a larger number of the user eventsin the index.
 20. A computer program product for processing user eventsof a platform to extract aggregate information about users of theplatform, the computer program product comprising executableinstructions stored on a computer readable storage medium andconfigured, when executed at an event processing system, to carry outthe following steps: receiving a query relating to the user events;determining at least one query parameter from the query; computing aunique user count for a set of the user events satisfying the at leastone query parameter; comparing the unique user count to a meteringthreshold; and rejecting the query if the unique user count exceeds amaximum permitted user count indicated by the metering threshold.